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Abstract. In a recent paper from SODAll [T] the autiiors introduced a 
general framework for exponential time improvement of min-wise based 
algorithms by defining and constructing almost k-min-wise independent 
family of hash functions. Here we take it a step forward and reduce 
the space and the independent needed for representing the functions, 
by defining and constructing a d-k-min-wise independent family of hash 
functions. Surprisingly, for most cases only 8-wise independent is needed 
for exponential time and space improvement. Moreover, we bypass the 
0(Iog -) independent lower bound for approximately mm-wise functions 
[2], as we use alternative definition. In addition, as the independent's 
degree is a small constant it can be implemented efficiently. 
Informally, under this definition, all subsets of size d of any fixed set 
X have an equal probability to have hash values among the minimal k 
values in X, where the probability is over the random choice of hash 
function from the family. This property measures the randomness of 
the family, as choosing a truly random function, obviously, satisfies the 
definition for d — k — \X\. We define and give an efficient time and 
space construction of approximately d-k-min-wise independent family 
of hash functions. The degree of independent required is optimal, i.e. 
only 0{d) for 2 < d < k = 0(-4-), where e G (0, 1) is the desired error 
bound. This construction can be used to improve many min-wise based 
algorithms, such as [3I4I5I6I7I8I9I10I11I12I13I14I15I16I17I18] . as will be 
discussed here. To our knowledge such definitions, for hash functions, 
were never studied and no construction was given before. 



1 Introduction 

Hash functions are fundamental building blocks of many algorithms. They map 
values from one domain to another, usually smaller. Although they have been 
studied for many years, designing hash functions is still a hot topic in modern 
research. In a perfect world we could use a truly random hash function, one that 
would be chosen randomly out of all the possible mappings. 

Specifically, consider the domain of all hash functions ft, : TV — M, where 
I A^l = n and \M\ = m. As we need to map each of the n elements in the source 
into one of the m possible mappings, the number of bits needed to maintain 
each function is nlogm. Since nowadays we often have massive amount of data 
to process, this amount of space is not feasible. Nevertheless, most algorithms 



do not really need such a high level of randomness, and can perform well enough 
with some relaxations. In such cases one can use a much smaller domain of hash 
functions. A smaller domain implies lesser space requirement at the price of a 
lower level of randomness. 

As an illustrative example, the notion of 2-wise-independent family of hash 
functions assures the independence of each pair of elements. It is known that 
only 2 log m bits are enough in order to choose and maintain such a function out 
of the family. 

This work is focused on the area of min-hashing. One derivative of min- 
hashing is min-wise independent permutations, which were first introduced in 
[19120] . A family of permutations F e 5„ (where Sn the symmetric group) is 
min-wise independent if for any set X C [n] (where [n] = {0, . . . ,n— 1}) and 
any x ^ X, where n is chosen uniformly at random in F, we have: 

Pr[min{^(X)} = 7r{x)] - ^ 

Similarly, a family of functions H G [n] — 5- [n] (where [n] — {0, . . . , n — 1}) 
is called min-wise independent if for any X C [n], and for any x ^ X, where 
h is chosen uniformly at random in H, we have: 

Pn,en[rnm{hiX)} - h{x)] = ^ 

Min hashing is a widely used tool for solving problems in computer science 
such as estimating similarity [20121122] . rarity [3], transitive closure [5], web 
page duphcate detection [5I11I17I1S] . sketching techniques [8l7j . and other data 
mining problems [611411619110] . 

One of the key properties of min hashing is that it enables to sample the 
universe of the elements being hashed. This is because each element, over the 
random choice of hash function out of the family, has equal probability of being 
mapped to the minimal value, regardless of the number of occurrences of the 
element. Thus, by maintaining the element with the minimal hash value over 
the input, one can sample the universe. 

Similarity estimation of data sets is a fundamental tool in mining data. It 
is often calculated using the Jaccard similarity coefficient which is defined by 
^l^^j , where A and B are two data sets. By maintaining the minimal hash 
value over two sets of data inputs A and B, the probability of getting the same 
hash value is exactly which equals the Jaccard similarity coefficient, as 

described in [20I21I22I3| . 

Indyk in [23] was first to give a construction of a small approximately min- 
wise independent family of hash functions, another construction was proposed 
in [21]. 

A family of functions H C [n] — [n] is called approximately min-wise inde- 
pendent, or e-min-wise independent, if, for any X C [n], and for any x € X, 



where h is chosen uniformly at random in H, we have: 



Prhen[inm{h{X)} = h{x)] = —(1 ± e) 



where e G (0, 1) is the desired error bound, and 0(log(i)) independent is needed. 
Patra§cu and Thorup showed in [2] that O(logi) independent is needed for 
maintaining an approximately min-wise function, hence Indyk's construction is 
optimal, and the minimal number of bits needed to represent each function is 



In a previous paper [I] the authors defined and gave a construction for ap- 
proximately k-min-wise ( e-k-min-wise ) independent family of hash func- 
tions: 

A family of functions 7^ C [n] — > [n] (where [n] = {0...n — 1}) is called e- 
k-min-wise independent if for any X C [n] and for any Y d X , \Y\ — k we 
have 



where the function h is chosen uniformly at random from H, and e G (0, 1) is 
the error bound. 

It was also shown in [T] that choosing uniformly at random from 0{k log log ^ + 
log ■i)-wise independent family of hash functions is approximately k-min-wise in- 
dependent. In most applications, k different approximately min-wise hash func- 
tions were used, and they proposed to replace them with only one approximately 
k-min-wise hash function. As the k elements are fully independent, the preci- 
sion is preserved. Furthermore, this reduces exponentially the running time and 
asymptotically the space of previous known results for min-wise based algo- 
rithms. 

1.1 Our Contribution 

In this paper we define and construct a small approximately d-k-min-wise in- 
dependent family of hash functions. First, we extend the notion of min-wise 
independent family of hash functions by defining a d-k-min-wise independent 
family of hash functions. Then, we show a construction of an approximated such 
family. Under this definition, all subsets of size d of any fixed set X have an 
equal probability to have hash values among the minimal k values in X, where 
the probability is over the random choice of hash function from the family. The 
formal definition is given in section 2. The degree of independent and the space 
needed by our construction is 0{d), for 2 < d < k — O(p-), where e G (0,1) 
is the error bound. The construction is optimal, since by our definition the k 
elements are d-wise independent. In addition, the dependency on d but not on k 
is surprising, but the intuition behind that is the stability property of the fc'th 
sized element, for large enough k. Hence, the randomness needed is mainly for 
the independence of the d elements. 



O(lognlog(i)). 
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We argue that for most applications it is sufficient to use constant d = 2. 
Tfiis yields the need of only 8-wise independent hash functions, which can be 
implemented efficiently. Our innovative approach bypasses the O(logi) lower 
bound of approximately min-wise functions, as this constraint is not required 
by our definition. 

To utilize our construction we propose a simple and even better general 
framework for exponential time improvement of min-wise based algorithms, such 
as in |8l4l5l6l7l8l9ll()lllll2li:ill4ll5ll6ll7ll8] . The common between these algo- 
rithms is that they use c > 1 approximately min-wise hash functions in order 
to sample c elements independently from the universe. We propose to replace 
them with fewer (say O(logi), for r G (0,1)) approximately d-k-min-wise (for 
k < c) independent functions, where each samples k elements. The k elements 
are d-wise independent, therefore we can use Chebyshev's inequality to bound 
the precision. Specifically, pair-wise is sufficient for applying Chebyshev, and this 
is why c? = 2 is usually used. By combining the log - samples using Chernoff 
bound, the precision becomes as desired. The above procedure does not change 
the algorithm itself, but only the way it samples, hence it is simple to adapt. 
We found this to improve exponentially the time and space complexity (as the 
space for each function is constant). 



1.2 Outline 

The outline of the paper is as follows. In section 2 we define the notion of d- 
k-min-wise and approximately d-k-min-wise independent families. In section 3 
we present a construction of such family. Finally, few lemmata are left to the 
appendix. 

2 Definitions 

We will start by giving our definitions for exact and approximately d-k-min- 
wise independent family of hash functions, which are generalization of min-wise 
independent family of hash functions. 

Definition 2.1. For any set X we define MINk{X) to be the set of k smallest 
elements in X . 

Definition 2.2. For any set X we define RAN Kk{X) to he the k-th elements 
in X , where the elements are sorted by value. 

Definition 2.3. For any set X , and hash function h we define h{X) to be the 
set of all hash values of all elements in X . 

Definition 2.4. d-k-min-wise independent family of hash functions: 
A family of functions ?^ C [u] — > [u] (where [u] = {0...U — 1}) is called d-k- 
min-wise independent if for any X C [u], \X\ — n — d, for any Y C [u], \Y\ = d, 



X f]Y = d <= k, we have 

Pr [RANKd{h{Y)) < RAN Kk-d+i{h{X))] = 

Where the function h is chosen uniformly at random from Ti. 

Definition 2.5. Approximately d-k-min-wise (e-d-k-min-wise ) independent fam- 
ily of hash functions: 

A family of functions "H C [u] — > [u] (where [u] = {0...U—1}) is called ap- 
proximately d-k-min-wise independent if for any X C [u], \X\ = n — d, for any 
Y C [u], \Y\^d, Xf]Y = 9, d<^ k, we have 

Pr [RANKa(MY)) < RAN Kk-d+i{h{X))] - ^{l ± e) 

Where the function h is chosen uniformly at random from H, and e G (0, 1) is 
the error hound. 



3 Construction 

In this section we provide a construction for approximately d-k-min-wise inde- 
pendent family of hash functions. Let Pr [•] denote a fully random probability 
measure over [u] — [u], and let Pr; [•] denote any 1-wise independent probability 
measure over the same domain. We divide the universe into \(f>\ non-overlapping 
blocks, which will be defined in the next section. 

Lemma 3.1. 

Pr [/i(yi), h{y2), h{ya) < RAN Ku-d+i{h{X))] = . . . — + ^ 



- 1 n-d+l 

Proof. Consider n ordered elements divided into two groups — one of size n — d, 
and the other of size d. The number of possible locations of the d elements is 
(^). There are (^) possible locations in which the d elements are among the A; 
smallest elements. Hence, the probability for the d element to be among the fc'th 
smallest elements is: 

Pr [h{y,),hiy2), h{yd) < RANK,^d+i{h{X))] = |2- = . . . ^''j+j 

□ 

Since the blocks in (p are non-overlapping 
^,g^Pr, [RANKk-d+i{h{X)) £ b,] = 1, using lemmaOwe get 

Pr [/i(2/i), /i(y2), . . . , h{yd) < RAN Kk^d+i{h{X))] = 
k k ~ I k — d 



1 n — d . _ 

ie(l> 



i ^ Pr [RANKk-d+i{h{X)) e b,] 



Lemma 3.2. Let A = 



i&4 



Pr [h{yi), hivd) < RANKk-d+i{h{X)) \ RANKk-d+i{h{X)) € bi] 

; nn — l 

Any family ofl-wise independent is approximately d-k-min-wise independent if 

Proof. Based on the complete probability formula, in the 1-wise independent case 

Pr [hiyi), h{y2), Kya) < RAN Kk-d+i{h{X))] = 

Pr [hiyi),h{y2), . . . , h{yd) < RAN Kk-d+i{h{X)) \ RANKk-d+i{h{X)) G b,] 

By definition, any family of /-wise independent is approximately d-k-min-wise 
independent if 

Pr [h{yi), h{y2), • • . , h{yd) < RANKk-d+i{h{X))] = ^{l±e) 
which is satisfied if k k 

□ 



kk-1 fc-d+l" 



3.1 Blocks partitioning 

We divide the universe [0, \ U\] into non-overlapping blocks. Let t = k — d-\- 1 
and m = n — d = \X\, we construct the blocks around as follows: bi = 

(l + e(z-l))^,(l + «)4^) 
We refer to blocks 6, for i > as 'positive blocks' and 'negative blocks' otherwise 
(i < 0). For the rest of the paper, we ignore blocks which are outside [0, |?7|]. 

3.2 Bounding Prj [RANKt{h{X)) G h] 

1 + - 

Lemma 3.3. Fori > 0, > 0, e G (0,1), k > ^-1 + 2-81^^ andl > 2d+2 
PT[RANKt{h{X))Gbi]<^ 



Proof. For block bi, X — {xi, . . . , x,n} we define Zj to be indicator variable s.t. 

^, = (1 hix,)<il + ejy-^ 
[0 otherwise 

In addition we define Z — J2j ^-nd Ei to be the expected value of Z. Notice 
that since Z is sum of indicator variables Ei = {1 + ei):^{m) — t{l + ei) . 
We use the above definitions to show that 

Pr [RANKtih{X)) e b,] < 

Pr[nuniber of hash values smaller than the lower boundary of block bi < t] = 
Pi[Z <t]< Pr[Ei - Z>Ei-t]< Pr[\Ei - Z\ > Ei ~ t] ^ 
Pr[\ Z - Ei\ > t{l + ei) - t] 
Using Markov inequality, assuming 1 is even: 

Pr[\Z-E^\>te^]<^^f:^ 

We use the following from lemma [A?3l 

Ei\Z~E,\') < 8i6l)^iE,)i 

Thus, 

8(60^(^(1 + «))^ 



PT[RANKt{h{X)) e hi\ < 



In order to have 

Pi [RAN Kt{h{X)) eb,] < 



We need 



or 



8(60^ (i(l + ^ _J_ 



[teiY 

8(60^(1 + «)^ 



8?(6/)i+i(l + «) 



e^t ' 

choosing I > 2d + 2 and substituting variables, we now need to show 

8T(603l±£!)+,_i<, 



8T ^ ' +8T-^^ + d-l < k 

i+- 

which is satisfied for A: > d - 1 + 2 • 8t ' □ 



As the proof for negative blocks is similar, we left it to the appendix. See 
lemma \K7l\ as follows: 

Lemma For i > 0, d > 0, e G (0, 1), fc > d - 1 + 2 • 8t ^^^^^ and / > 2d + 2: 
PT[RANKt{h{X)) eb^,] < ^ 



3.3 Bounding A 

In this section we prove the upper and lower bounds of lemma 13.21 i.e. that 

~^7^ < ^ < ^7^- As d = 2 is the main use case, and since the proof for 

[dj (d) 

general d is too technical, we focus on the first, and leave the extended proof for 
the full paper. 

We define Pr, [RANKtih{X)) G b>j] by Pr, [u°Z-iRANKt{h{X)) G b,] and 
similarly Pr, [RANKt{h{X)) G b<j] by Pr, [ul^_^RANKt{h{X)) G b. 

Lemma 3.4. For d = 2, d > 0, e G (0, 1), k > d-l + 2-8T^^^i^ andl > 2d+2: 



A < e 



id) 



Proof. 



y PrjRANKtih{X)) G bi 



Pr [hiyi),h{y2), . . . , /i(yd) < RANKtihiX)) \ RANKt{h{X)) G 6, 

l+d 



Using d independent (out of / + d) for h{yi), h{y2), . . . , h{yd), we get 
Pr [RAN K,{h{X))eb,] + - 



(2 



< 



J2 Pr[RANKt{h{X)) G b,] 

i— — oo 

PT[RANKt{h{X)) G bo 



t f'V 



Pv[RANKt{h{X)) G &i] 



^Pr [RAN Ktih{X)) G b. 



By changing the order we get a telescoping sum as follows: 



PT[RANKt{h{X)) e b< 



PY[RANKt{h{X)) e b<o] 



i-r 

m 





( ^^ 



Pr[RANKt{h{X)) e &>i] 



i-ni + ey 

m 







^Pr [RANKt{h{X))&h>,] 



(l)'i(l + «)^_(l)<i(l + e(z-l))'^ 
m TO 



< 



Applying lemma 15^ and lemma bounding the probabilities of blocks &o 
and bi with 1 



E + «)' - + <^ + i))'i+ 



^ 1 + 

1=2 



We now substitute d 



E^I(^^)V(2^-1)-2.))| + 



n — 2 n n — 1 



, k . 1 k k 1, 

-fil + ef 



n - 2' 



nn — I 



°o 1 h — ^ 

Eil(^)'(^'(2^-l) + 2^))l^^ 



1=2 



2-^E4jl^'(2*-l)-2e|- 
n n - 1 ^ H 

i— 1 ' ' 



, k 1\9 /bA^ 1. 

( o) t\ 

n — 2 n n ^ 1 



, k l,--), ,n k k 1, 



n - 2' 



nn — I 



n n — 1 ^ — ' i'' 

1=2 



c -e 

nn — 1 

□ 

The proof for negative blocks is similar, hence we left it to the appendix. See 
lemma ET2| as follows: 

1+ ^ 

Lemma For d = 2, e e (0, 1), fc > d - 1 + 2 • 8t '^^^^ and ^ > 2d + 2: 
-A < e|i|. 

\d) 

Theorem 3.1. For d 2, e G (0, 1), e' = f , fc > d - 1 + 2 • St^^^I^ and 
I > 3d + 2, Any l-wise independent family of hash functions is approximately 
d-k-min-wise (e-d-k-min-wise ). 

Proof. Applying lemma 15^ and lemma lA. 2 1 to lemma concludes the proof. 

□ 

Theorem 3.2. For d < ^, e e (0,1), e' = ^ , k > d - 1 + 2 ■ St^^^I^ and 
I > 3d + 2, Any l-wise independent family of hash functions is approximately 
d-k-min-wise (e-d-k-min-wise ). 

Proof. Applying generalization of lemma 13.41 and lemma IA.2I to lemma 13.21 con- 
cludes the proof. Due to lack of space we omitted part of the generalizations, 
which will be given in the full paper. 

□ 
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A Appendix 

Lemma A.l. Fori>0,d> 0, e G (0, 1), k > rf-l + 2-8TM^ and I > 2d+2 
PT[RANKt{h{X)) e b-^] < 



Proof. For block X — {xi, . . . ,Xm} we define Zj to be indicator variable 
s.t. 



\0 otherwise 

In addition we define Z — J^j ^ji ^-nd E^i to be the expected value of Z. Notice 
that since Z is sum of indicator variables E^i = (1 — ei)^m = (1 — ei)t. 
We use the above definitions to show that 

Pr [RANKtihiX)) G < 
Pr[number of hash values smaller than the upper boundary of block b-i > t] = 

Pi-[Z >t]= Pr[Z - E^i>t- E^,] < Pr[|Z - £^-^1 > tei)] 
Using Markov inequality, assuming 1 is even: 

We use the following from lemma [ST3l 

E{\Z-E,\')<8{6l)^{E,)^ 

Thus, 

Fr[RANKMX)) G 6_.] < m)"^ (til - e^))i 
I [tei\' 

In order to have 



1 



We need 



or 



Pv[RANKt{h{X))eb^,]< ^^^^ 



[tei] 



4d+l 



8(60 + (l-«)i ^ . 

gZ^(i-d-l) — 

8T(6/)i+T(l-ez) 



choosing I >2d + 2 and substituting variables, we now need to show 



8T 



(60 



1+1 



_8T 



2 {6iy+Te 



+d-l<k 



which is satisfied for k > d— 1 + -8t — 



□ 



Lemma A.2. Ford = 2,e& (0,1), k > d - l + 2-8fM^ andl> 2d + 2: 



-A<e] 



Proof. 



J2 Pr^[RANKk-iihiX)) e b,] 



Pr [hiyi),h{y2) < RAN Kk-i{h{X)) \ RAN Kk-i{h{X)) G h] - 

i+d nn — 1 



> 



We now use the lower part of the block to bound probability. Using 2 inde- 
pendent (out of / + d) for h{yi), h{y2), ■ ■ ■ , h{yd), we get 



^ Vv[RANKk-i{h{X)) G hi+i] 



/k ..n k k 1 

( ^) (1 + «) 7 



Pr[RANKk-i{h{X))€bi+i] 



kk-1 



PT[RANKk-i{h{X))ebo] 



nn — 1 

kk-1 



+ 



\-2' ^ ' nn-1 
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By changing the order we get a telescoping sum as follows: 
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applying lemma 13.31 and lemma lA.ll 
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Lemma A. 3. Let Zj be a set of indicator variables, let Z = Zj, let Ei be 

the expected value of Z , and let I > be even. 

E{\Z-E,\') < 8i6l)^iE,)i 

Proof. The proof is based on Indyk's lemma 2.2 in |23| . with the following minor 
change: 



E{\Z - < 2j2ij' ■ 2e"^^-) < 4{3E,)^ ^(s' 



